Deep Learning Algorithms for the Detection of Suspicious Pigmented Skin Lesions in Primary Care Settings: A Systematic Review and Meta-Analysis

Early detection of suspicious pigmented skin lesions is crucial for improving the outcomes and survival rates of skin cancers. However, the accuracy of clinical diagnosis by primary care physicians (PCPs) is suboptimal, leading to unnecessary referrals and biopsies. In recent years, deep learning (DL) algorithms have shown promising results in the automated detection and classification of skin lesions. This systematic review and meta-analysis aimed to evaluate the diagnostic performance of DL algorithms for the detection of suspicious pigmented skin lesions in primary care settings. A comprehensive literature search was conducted using electronic databases, including PubMed, Scopus, IEEE Xplore, Cochrane Central Register of Controlled Trials (CENTRAL), and Web of Science. Data from eligible studies were extracted, including study characteristics, sample size, algorithm type, sensitivity, specificity, diagnostic odds ratio (DOR), positive likelihood ratio (PLR), negative likelihood ratio (NLR), and receiver operating characteristic curve analysis. Three studies were included. The results showed that DL algorithms had a high sensitivity (90%, 95% CI: 90-91%) and specificity (85%, 95% CI: 84-86%) for detecting suspicious pigmented skin lesions in primary care settings. Significant heterogeneity was observed in both sensitivity (p = 0.0062, I2 = 80.3%) and specificity (p < 0.001, I2 = 98.8%). The analysis of DOR and PLR further demonstrated the strong diagnostic performance of DL algorithms. The DOR was 26.39, indicating a strong overall diagnostic performance of DL algorithms. The PLR was 4.30, highlighting the ability of these algorithms to influence diagnostic outcomes positively. The NLR was 0.16, indicating that a negative test result decreased the odds of misdiagnosis. The area under the curve of DL algorithms was 0.95, indicating excellent discriminative ability in distinguishing between benign and malignant pigmented skin lesions. DL algorithms have the potential to significantly improve the detection of suspicious pigmented skin lesions in primary care settings. Our analysis showed that DL exhibited promising performance in the early detection of suspicious pigmented skin lesions. However, further studies are needed.


Introduction And Background
Skin cancer is one of the most common cancers worldwide, with an estimated 1.5 million new cases diagnosed each year [1].Among the different types of skin cancer, melanoma is the most aggressive tumor, accounting for approximately 75% of skin cancer-related deaths [2].The burden of skin cancers is expected to increase further due to population growth, aging, and increased exposure to ultraviolet radiation [1].Early detection and diagnosis of skin cancer, particularly melanoma, is crucial for improving patient outcomes and survival rates.The five-year survival rate for melanoma patients with localized disease is over 99%, but it drops dramatically to 74% for regional disease and 35% for distant metastasis [3].A significant proportion of skin cancers are diagnosed at advanced stages, leading to worse prognoses and increased healthcare costs [4].
Primary care physicians (PCPs) are often the first point of contact for patients with suspicious skin lesions.In primary care settings, visual inspection of patients to identify suspicious pigmented lesions is a well-established dermatological practice.The ABCDE criteria, which include asymmetry, border unevenness, color distribution, diameter, and evolution, have been traditionally used for early-stage melanoma screening by PCPs [5].However, recent research suggests that highly accurate and skilled clinical detection of melanoma relies more on unconscious visual pattern recognition and "ugly duckling" comparisons than on the simplified ABCDE criteria.However, the accuracy of clinical diagnosis by PCPs is suboptimal, with studies reporting sensitivities ranging from 55% to 73% [6,7].This can result in unnecessary referrals and biopsies, causing anxiety and stress for patients and increasing the burden on healthcare systems [8].Therefore, there is a pressing need for more accurate and efficient diagnostic tools to aid PCPs in the early detection and diagnosis of skin cancers.
In recent years, artificial intelligence (AI) has shown great promise in improving the diagnosis and management of various medical conditions, including skin cancers [9].Deep learning (DL) is a subset of machine learning (ML) that uses artificial neural networks to model and solve complex problems.These algorithms can automatically learn and extract relevant features from large datasets, enabling them to identify patterns and make predictions with high accuracy [10].In the context of skin cancer, DL algorithms have been applied to the analysis of clinical images to improve the diagnostic accuracy of skin lesions compared to naked-eye examination alone by providing automated and objective analysis of dermoscopic images.Several studies have demonstrated the efficacy of DL algorithms in detecting and classifying skin lesions, including melanoma, with high sensitivity and specificity [11].Nonetheless, these studies evaluated DL algorithms for dermoscopic images, which require specialized training and expertise that are not readily available in primary care settings [11].
In recent years, advances in smartphone technologies have increased access to high-quality personal cameras and robust mobile computing systems, which have been applied to dermatology using computeraided diagnosis (CAD) systems [12].Recent studies have evaluated the usefulness of employing DL algorithms in next-generation CAD systems for the evaluation of suspicious pigmented lesions in primary care settings.These studies have shown that DL-based models can achieve comparable or even superior diagnostic accuracy to board-certified dermatologists in visual inspection [13,14].However, the diagnostic performance of DL algorithms for the detection of suspicious pigmented skin lesions in primary care settings has not been systematically evaluated.
Therefore, this systematic review and meta-analysis aimed to evaluate the diagnostic accuracy of DL algorithms for the detection of suspicious pigmented skin lesions in primary care settings.

Review Methods
The present systematic review and meta-analysis were prepared in concordance with the 2020 Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [15].

Eligibility Criteria and Screening
We included all studies that utilized DL algorithms for the detection of suspicious pigmented skin lesions in primary care settings.We limited our inclusion to studies that reported sufficient data to calculate the diagnostic accuracy of DL algorithms, including true positives, false positives, true negatives, and false negatives.Studies that were non-English, case reports, case series, review articles, editorial comments, letters, and studies that focused on the use of DL algorithms outside of the primary care setting or for nonpigmented lesions were excluded.
A two-step screening process was conducted to identify studies that met the predefined inclusion criteria.First, two reviewers independently screened the titles and abstracts of all retrieved studies to identify potentially eligible studies.Then, the full texts of the potentially eligible studies were obtained and assessed independently by the same two reviewers to determine their eligibility for inclusion.Disagreements between the reviewers were resolved by discussion to reach a final decision.If a consensus could not be reached, a third reviewer was consulted to make a final decision.
Data extraction was performed systematically to extract relevant information from the included studies.The data extraction process was conducted by three reviewers who independently extracted data from each included study and recorded it in a predesigned Excel sheet (Microsoft Corporation, Redmond, WA).The extracted data included authors, year of publication, country, study design, demographic characteristics of the population, type of the lesion investigated, DL algorithms characteristics, and their sensitivity, specificity, positive predictive value, negative predictive value, and area under the receiver operating characteristic (ROC) curve.
The risk of bias assessment of the included studies was conducted using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) score to evaluate the methodological quality and applicability of primary diagnostic accuracy in the included studies [16].The risk of bias was assessed using four domains: patient selection, index test, reference standard, flow, and timing.Studies were rated as "low risk of bias," "high risk of bias," or "unclear risk of bias."

Data Analysis
The statistical analysis was conducted using the meta package in R software version 4.1.2(R Foundation for Statistical Computing, Vienna, Austria).We performed a bivariate random-effects meta-analysis model to synthesize sensitivity, specificity, positive likelihood ratio (PLR), negative likelihood ratio (NLR), and diagnostic odds ratios (DOR) from the included studies.The diagnostic performance across studies was summarized using the summary receiver operating characteristic (SROC) curve, along with the calculation of the area under the curve (AUC).Heterogeneity among the included studies was assessed using the I² statistic, where values of 25%, 50%, and 75% represent low, moderate, and high heterogeneity, respectively.Cochran's Q test was also used to evaluate the statistical significance of observed heterogeneities.A p-value of less than 10% represented significant heterogeneity.

Literature Search Results
A total of 473 unique records were identified through the literature search.A total of 202 records were excluded during the initial screening phase, and 271 full-text articles were assessed for eligibility.Of them, three studies were included in the present systematic review and meta-analysis (Figure 1) [12,14,17].

Characteristics of the Included Studies
Three studies employed DL techniques to detect pigmented skin lesions in primary settings.Sangers et al. conducted a prospective diagnostic accuracy study using a convolutional neural network (CNN, version RD-174).The study involved 372 participants with a median age of 71 years (range = 58-78), and approximately half were males (49.2%).Out of the 785 skin lesions assessed, 275 were identified as premalignant or malignant and 510 as benign.The main finding was that the AI-powered, commercially available application exhibited high sensitivity and specificity in detecting suspicious lesions [17].Birkenfeld et al. also carried out a prospective diagnostic accuracy study but used logistic regression combined with principal component analysis (PCA).The sample included 133 individuals, predominantly male (54.13%), with ages ranging from 16 to 76 years.The study analyzed a total of 1759 lesions, distributed between a training set of 1187 lesions and a test set of 572 lesions.The results highlighted that the DL-powered CAD system could efficiently differentiate between suspicious and non-suspicious lesions [12].Soenksen et al. utilized a deep convolutional neural network (DCNN) in a retrospective diagnostic accuracy study involving a large dataset of 33,980 images, which included 4,063 suspicious pigmented lesions (SPLs).Although specific demographic details were not provided, the study concluded that the DL-powered tool enabled accurate detection of SPLs within a primary care setting [14], as shown in Table

Quality Assessment
Birkenfeld et al. [12] showed high patient selection bias, while the subsequent two had a low risk of bias.However, all studies demonstrated low bias in index tests, reference standards, flow, and timing (Figure 2).
Figure 4A shows the findings of the DOR from three studies.The individual diagnostic ORs ranged from 15.78 to 82.88.When combined using a random effects model, the pooled DOR was 26.39 (95% CI: 6.79-102.63),indicating a substantial overall diagnostic performance.However, the studies had significant heterogeneity (p < 0.001, I2 =98.2%).The SROC curve showed that the pooled AUC was 0.9563, indicating high discriminatory power (Figure 4B).OR: odds ratio; SROC: summary receiver operating characteristic curve; ROC: receiver operating characteristic; AUC: area under the curve.

Discussion
Our research aimed to investigate and shed light on the effectiveness of DL algorithms in the early diagnosis of patients suffering from suspicious pigmented skin lesions in primary care settings.We compared the performance of these advanced technologies with the usual diagnostic methods followed by doctors to gain a comprehensive understanding of the potential benefits of DL algorithms.The sensitivity analysis revealed the high sensitivity and specificity of the DL algorithms for detecting suspicious pigmented skin lesions in primary care settings.Our data also suggested that the DL algorithms had increased the odds of correctly diagnosing suspicious pigmented skin lesions, with an excellent discriminative ability to distinguish between benign and malignant pigmented skin lesions.Consequently, they can serve as a reliable tool for primary care physicians to detect skin cancers such as melanoma early.However, our results should be interpreted cautiously as significant heterogeneities in the pooled analyses suggest variations across the studies.This could be attributed to differences in dataset characteristics, algorithm design, lesion type variations, the doctors' expertise, or variations in the reference standard used for comparison.
Our findings align with previous studies by Birkenfeld et al. [12] and Soenksen et al. [14], who also reported significant improvements in suspicious pigmented skin lesion outcomes following similar interventions in different populations.Furthermore, Tschandl et al. also found that when an AI algorithm supports a physician's diagnosis-making, the diagnostic accuracy improves over that of either AI or physicians alone [18].These findings have also been confirmed by other researchers [19].
Despite the promising results, there are several challenges and limitations that need to be acknowledged.Publication bias may exist, as studies with significant positive results are more likely to be published.
Additionally, the quality of the included studies varied, and some may have introduced biases that could affect the overall findings.Moreover, one major limitation is the lack of interpretability of these algorithms.Hence, the accuracy of these algorithms is hard to detect when they are applied without any physician input.Besides, it is hard to explain how these algorithms come to their results.ML algorithm often functions as a black box [17] that takes in inputs and produces outputs with no interpretation of how it produced the conclusions and results.This lack of transparency can pose limitations in gaining the trust and acceptance of healthcare physicians [20].Nevertheless, our study highlights the potential benefits of integrating AI and ML technologies, specifically DL algorithms, into routine primary care practice.These technologies can potentially enhance early detection, improve patient outcomes, and alleviate the burden on dermatologists [18].Recently, DL methods have also been explored in the non-invasive diagnosis of skin lesions and demonstrated their ability to classify skin lesions with high accuracy.However, some other opinions believe that it is difficult to enable AI and ML technologies in daily dermatological examinations [21].From a policy perspective, implementing large-scale skin cancer screening programs is not only likely to be a complex task but will also be infeasible in most resource-limited healthcare systems worldwide.In the United States, for example, there are fewer than 12,000 practicing dermatologists [22], and with fewer than 15 visits per 100 individuals per year [22], it is expected that most dermatology practices across the world are already too saturated and time-constrained to provide additional screening services.
Integrating DL algorithms in dermatology highlights the importance of ongoing education and training for dermatologists and primary care physicians.Continued professional development programs can help physicians stay updated with the latest technological advancements and ensure their competent use in clinical practice [20].Collaboration between medical schools, technology companies, and hospitals can facilitate the development of specialized training programs that equip healthcare professionals with the essential skills to effectively utilize DL algorithms.A recent study by Hekler et al. found that combining humans and AI achieves a better classification of images than only dermatologists or only classification by CNN [22].The mean accuracy increased by 1.36% when dermatologists worked together with ML and AI.
Dermatologists can contribute their expertise in curating high-quality datasets, training the algorithms, and validating their performance to optimize the use of DL algorithms in primary care settings.Primary care physicians can provide valuable insights into the practical implementation of these algorithms and ensure their integration into existing healthcare workflows.Additionally, exploring potential combination therapies involving AI and other interventions may yield synergistic effects and further enhance skin cancer outcomes.
Future research in this field should focus on refining DL algorithms to enhance their performance, reliability, and interpretability.Prospective studies with larger sample sizes and diverse patient populations are needed to validate the findings of this meta-analysis.Additionally, long-term follow-up studies can assess the impact of AI and ML algorithms on patient outcomes, including detecting early-stage skin cancers and reducing mortality rates.Human-machine collaboration has revealed promising results for future applications.For further expansion in this field, machines could assist physicians in time-consuming practices that usually are not being applied.

Conclusions
DL algorithms have the potential to significantly improve the detection of suspicious pigmented skin lesions in primary care settings.Our analysis showed that DL exhibited promising performance in the early detection of suspicious pigmented skin lesions.However, further studies are needed.Furthermore, the integration of DL algorithms into primary care could potentially reduce the number of unnecessary biopsies and referrals, thereby optimizing resource allocation and improving patient outcomes.It is also crucial to investigate the potential challenges and limitations of DL implementation in real-world clinical settings, such as data privacy concerns and the need for standardized training datasets, to ensure its safe and effective use in dermatological practice.

FIGURE 1 :
FIGURE 1: PRISMA 2020 flow diagram.PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses.

FIGURE 3 :
FIGURE 3: Data presents (a) a meta-analysis of sensitivity across three studies, (b) a meta-analysis of specificity across three studies, (c) a meta-analysis of positive likelihood ratios from three studies, and (d) a meta-analysis of negative likelihood ratios from three studies.Birkenfeld et al. 2020 [12], Soenksen et al. 2021 [14], and Sangers et al. 2022 [17].

A systematic literature search was conducted across multiple electronic databases, including MEDLINE via PubMed, Scopus, IEEE Xplore, Cochrane Central Register of Controlled Trials (CENTRAL), and Web of Science databases. The search strategy for PubMed and IEEE Xplore were as follows
1.

TABLE 1 : Summary characteristics of the included studies.
DCNN: deep convolutional neural network; PCA: principal component analysis; CNN: convolutional neural network; DL: deep learning; PPV: positive predictive value; NPV: negative predictive value; NR: not reported.